Random Forests





Kerry Back

Types of forests

  • Random forest
  • Boosting
    • Gradient boosting
    • Extreme gradient boosting (XG Boost)
    • Adaptive boosting (Ada Boost)

We’ll use random forests.

Random forest

  • Create samples of same size as original by drawing observations randomly with replacement from the original sample.
  • Fit a tree to each randomly generated sample.
  • Average the trees to predict.

Predictions

  • In each tree, answer the yes/no questions to find the leaf for an observation
  • The prediction for that tree is
    • the leaf mean for regression
    • the leaf class probabilities for classification
  • Repeat for each tree in the forest and average predictions
  • For classification, the class with the highest average probability is the prediction.

Example

  • Random forest regression for ranks.
  • Get roeq and mom12m as before, and define rnk.
  • Could also try return or numerical class (0, 1, 2).

Define and fit model

from sklearn.ensemble import RandomForestRegressor

X = data[["roeq", "mom12m"]]
y = data["rnk"]

model = RandomForestRegressor(
  max_depth=4,
  random_state=0
)
model.fit(X,y)

Things to do

  • How well does it fit? Calculate \(R^2\) with model.score(X,y).
  • How important is each of the predictors? Calculate with model.feature_importances_.
  • Make a prediction for a new observation with model.predict.
  • Sort on predictions. Do portfolios of stocks with high predictions have high returns?

R-squared

model.score(X,y)
0.12891622261119873


Importance of features

model.feature_importances_
array([0.45752406, 0.54247594])


Make a prediction

import numpy as np
x = np.array([.1, .4]).reshape(1,2)
model.predict(x)
array([0.50320694])

Sorts

Sort based on predictions and compute portfolio returns (in sample):

prnk = model.predict(X)
data['quintile'] = pd.qcut(prnk, 5, labels=range(1,6))
rets = data.groupby('quintile').ret.mean()
rets
quintile
1    0.012690
2    0.013577
3    0.091312
4    0.093226
5    0.332981
Name: ret, dtype: float64

Still to do

  • We need to evaluate performance out of sample.
    • Include prediction as part of a portfolio strategy.
    • Backtest the strategy.
  • Fit over multiple months (past months in backtest).
  • Include more predictors.
  • Implement a strategy for choosing max_depth and deciding whether to predict returns, ranks, classes, or …

Saving and loading models

  • Training models and using models are separate activities.
  • Training may occur on a different computer, by a different team, take place over several days, …
  • Trained models can be saved and shared, then loaded when needed.
from joblib import dump, load
dump(model, "forest1.joblib")
  • Later:
forest = load("forest1.joblib")
forest.predict(x)